PPO1 (PPO-Clip) — low-level PyTorch implementation#

Goal: implement the classic clipped surrogate objective version of Proximal Policy Optimization (often referred to as PPO1 in older codebases) using plain PyTorch (no RL libraries), and visualize:

  • policy probability ratios (r_t) and clipping behavior (Plotly)

  • learning curves and reward per episode (Plotly)

This notebook is designed to be offline-friendly and runs on CartPole-v1 (Gymnasium).

Notebook roadmap#

  1. PPO1 objective: intuition + the clipped surrogate (LaTeX)

  2. A minimal PyTorch actor-critic

  3. Rollout collection + GAE((\gamma,\lambda))

  4. PPO clipped update (multiple epochs + minibatches)

  5. Plotly visualizations: ratios, clipping, reward per episode

  6. Stable-Baselines PPO1 reference implementation (web research)

  7. Hyperparameters (what they do + tuning tips)

Prerequisites#

  • Python + PyTorch

  • Gymnasium (gymnasium)

  • Plotly

Everything is self-contained (no downloads).

import math
import random
from dataclasses import dataclass

import numpy as np
import pandas as pd

import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

import torch
import torch.nn as nn
import torch.nn.functional as F

import gymnasium as gym

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

print("torch:", torch.__version__)
print("gymnasium:", gym.__version__)
print("plotly:", plotly.__version__)
torch: 2.7.0+cu126
gymnasium: 1.1.1
plotly: 6.5.2
# --- Reproducibility ---
SEED = 7
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

# --- Run configuration ---
FAST_RUN = True  # set False for longer training

ENV_ID = "CartPole-v1"

ROLLOUT_STEPS = 512 if FAST_RUN else 2048
N_UPDATES = 40 if FAST_RUN else 200
TOTAL_TIMESTEPS = N_UPDATES * ROLLOUT_STEPS

UPDATE_EPOCHS = 4
MINIBATCH_SIZE = 128

GAMMA = 0.99
GAE_LAMBDA = 0.95

CLIP_EPS = 0.2
LEARNING_RATE = 3e-4
ADAM_EPS = 1e-5

ENT_COEF = 0.0
VF_COEF = 0.5
MAX_GRAD_NORM = 0.5

# Extra logging
LOG_EVERY_UPDATES = 1

# Device (suppress noisy CUDA init warnings in restricted environments)
import warnings

warnings.filterwarnings('ignore', message='CUDA initialization:.*')

cuda_ok = bool(torch.cuda.is_available())
device = torch.device("cuda" if cuda_ok else "cpu")
print("device:", device)
print("updates:", N_UPDATES)
print("total_timesteps:", TOTAL_TIMESTEPS)
device: cpu
updates: 40
total_timesteps: 20480

1) PPO1 / PPO-Clip objective (clipped surrogate)#

PPO maintains a current policy (\pi_{\theta}) and a behavior (old) policy (\pi_{\theta_{\mathrm{old}}}) that generated a batch of data.

Define the probability ratio:

[ r_t(\theta) = \frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)} ]

Let (A_t) be an advantage estimate (commonly GAE). The clipped surrogate objective is:

[ L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\left[\min\Big( r_t(\theta)A_t,; \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon),A_t \Big)\right] ]

Intuition:

  • If (A_t > 0): we want (\pi_\theta) to increase probability of (a_t), but we cap the improvement when (r_t) exceeds (1+\epsilon).

  • If (A_t < 0): we want (\pi_\theta) to decrease probability of (a_t), but we cap the degradation when (r_t) falls below (1-\epsilon).

In code we typically minimize the negative objective: policy_loss = -mean(min(...)).

2) Environment#

We use CartPole-v1 (discrete actions, low dimensional state). PPO also works for continuous actions; the PPO1 clipping logic is the same.

env = gym.make(ENV_ID)
env.action_space.seed(SEED)

obs_dim = int(np.prod(env.observation_space.shape))
assert isinstance(env.action_space, gym.spaces.Discrete)
action_dim = env.action_space.n

print("obs_dim:", obs_dim)
print("action_dim:", action_dim)
obs_dim: 4
action_dim: 2

3) Low-level PyTorch actor-critic#

We implement:

  • actor: outputs logits for a categorical action distribution

  • critic: outputs state-value (V(s))

No helper RL libraries; just torch.

class ActorCritic(nn.Module):
    def __init__(self, obs_dim: int, action_dim: int, hidden: int = 64):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.Tanh(),
            nn.Linear(hidden, hidden),
            nn.Tanh(),
        )
        self.policy_head = nn.Linear(hidden, action_dim)
        self.value_head = nn.Linear(hidden, 1)

        # Orthogonal init is common for PPO
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.orthogonal_(m.weight, gain=math.sqrt(2))
                nn.init.constant_(m.bias, 0.0)
        nn.init.orthogonal_(self.policy_head.weight, gain=0.01)

    def forward(self, obs: torch.Tensor):
        x = self.backbone(obs)
        logits = self.policy_head(x)
        value = self.value_head(x).squeeze(-1)
        return logits, value

    @torch.no_grad()
    def act(self, obs: torch.Tensor):
        logits, value = self.forward(obs)
        dist = torch.distributions.Categorical(logits=logits)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, entropy, value

    def evaluate_actions(self, obs: torch.Tensor, actions: torch.Tensor):
        logits, value = self.forward(obs)
        dist = torch.distributions.Categorical(logits=logits)
        log_prob = dist.log_prob(actions)
        entropy = dist.entropy()
        return log_prob, entropy, value


agent = ActorCritic(obs_dim, action_dim).to(device)
optimizer = torch.optim.Adam(agent.parameters(), lr=LEARNING_RATE, eps=ADAM_EPS)

4) Rollouts + GAE#

We collect an on-policy rollout of length ROLLOUT_STEPS, then compute:

  • advantages (A_t) via Generalized Advantage Estimation (GAE)

  • returns (R_t = A_t + V(s_t))

Finally we do multiple epochs of minibatch optimization on the same rollout.

@dataclass
class Rollout:
    obs: torch.Tensor
    actions: torch.Tensor
    log_probs: torch.Tensor
    values: torch.Tensor
    rewards: torch.Tensor
    dones: torch.Tensor
    advantages: torch.Tensor
    returns: torch.Tensor


def compute_gae(
    rewards: np.ndarray,
    values: np.ndarray,
    dones: np.ndarray,
    last_value: float,
    *,
    gamma: float,
    lam: float,
):
    """GAE for a single-environment rollout."""
    T = len(rewards)
    adv = np.zeros(T, dtype=np.float32)
    gae = 0.0
    for t in reversed(range(T)):
        next_nonterminal = 1.0 - float(dones[t])
        next_value = last_value if t == T - 1 else values[t + 1]
        delta = rewards[t] + gamma * next_value * next_nonterminal - values[t]
        gae = delta + gamma * lam * next_nonterminal * gae
        adv[t] = gae
    ret = adv + values
    return adv, ret
def collect_rollout(env, agent: ActorCritic, rollout_steps: int, obs: np.ndarray):
    obs_list = []
    action_list = []
    logp_list = []
    value_list = []
    reward_list = []
    done_list = []

    episode_returns = []
    ep_return = 0.0

    for _ in range(rollout_steps):
        obs_tensor = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
        action, logp, entropy, value = agent.act(obs_tensor)

        action_item = int(action.item())
        next_obs, reward, terminated, truncated, _ = env.step(action_item)
        done = bool(terminated or truncated)

        obs_list.append(obs)
        action_list.append(action_item)
        logp_list.append(float(logp.item()))
        value_list.append(float(value.item()))
        reward_list.append(float(reward))
        done_list.append(done)

        ep_return += float(reward)

        obs = next_obs
        if done:
            episode_returns.append(ep_return)
            ep_return = 0.0
            obs, _ = env.reset()

    # Bootstrap value at the end of rollout
    with torch.no_grad():
        obs_tensor = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
        _, last_value = agent.forward(obs_tensor)
        last_value = float(last_value.item())

    obs_arr = np.asarray(obs_list, dtype=np.float32)
    actions_arr = np.asarray(action_list, dtype=np.int64)
    logp_arr = np.asarray(logp_list, dtype=np.float32)
    values_arr = np.asarray(value_list, dtype=np.float32)
    rewards_arr = np.asarray(reward_list, dtype=np.float32)
    dones_arr = np.asarray(done_list, dtype=np.bool_)

    adv_arr, ret_arr = compute_gae(
        rewards_arr,
        values_arr,
        dones_arr,
        last_value,
        gamma=GAMMA,
        lam=GAE_LAMBDA,
    )

    # Advantage normalization is a common PPO trick
    adv_arr = (adv_arr - adv_arr.mean()) / (adv_arr.std() + 1e-8)

    rollout = Rollout(
        obs=torch.tensor(obs_arr, dtype=torch.float32, device=device),
        actions=torch.tensor(actions_arr, dtype=torch.int64, device=device),
        log_probs=torch.tensor(logp_arr, dtype=torch.float32, device=device),
        values=torch.tensor(values_arr, dtype=torch.float32, device=device),
        rewards=torch.tensor(rewards_arr, dtype=torch.float32, device=device),
        dones=torch.tensor(dones_arr.astype(np.float32), dtype=torch.float32, device=device),
        advantages=torch.tensor(adv_arr, dtype=torch.float32, device=device),
        returns=torch.tensor(ret_arr, dtype=torch.float32, device=device),
    )

    return rollout, episode_returns, obs

5) PPO1 update step#

For each rollout batch we optimize the clipped surrogate objective over several epochs/minibatches.

We also log ratio statistics so we can visualize clipping.

def ppo_update(agent: ActorCritic, optimizer: torch.optim.Optimizer, rollout: Rollout):
    batch_size = rollout.obs.shape[0]
    b_inds = np.arange(batch_size)

    policy_losses = []
    value_losses = []
    entropies = []
    clip_fracs = []
    approx_kls = []

    for _ in range(UPDATE_EPOCHS):
        np.random.shuffle(b_inds)
        for start in range(0, batch_size, MINIBATCH_SIZE):
            end = start + MINIBATCH_SIZE
            mb_inds = b_inds[start:end]

            obs_b = rollout.obs[mb_inds]
            actions_b = rollout.actions[mb_inds]
            old_logp_b = rollout.log_probs[mb_inds]
            adv_b = rollout.advantages[mb_inds]
            ret_b = rollout.returns[mb_inds]

            new_logp, entropy, value = agent.evaluate_actions(obs_b, actions_b)

            log_ratio = new_logp - old_logp_b
            ratio = log_ratio.exp()

            # PPO clipped surrogate
            unclipped = ratio * adv_b
            clipped = ratio.clamp(1.0 - CLIP_EPS, 1.0 + CLIP_EPS) * adv_b
            policy_loss = -torch.mean(torch.minimum(unclipped, clipped))

            value_loss = F.mse_loss(value, ret_b)
            entropy_mean = torch.mean(entropy)

            loss = policy_loss + VF_COEF * value_loss - ENT_COEF * entropy_mean

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            nn.utils.clip_grad_norm_(agent.parameters(), MAX_GRAD_NORM)
            optimizer.step()

            # Diagnostics
            with torch.no_grad():
                approx_kl = torch.mean(old_logp_b - new_logp).item()
                clip_frac = torch.mean((torch.abs(ratio - 1.0) > CLIP_EPS).float()).item()

            policy_losses.append(policy_loss.item())
            value_losses.append(value_loss.item())
            entropies.append(entropy_mean.item())
            clip_fracs.append(clip_frac)
            approx_kls.append(approx_kl)

    return {
        "policy_loss": float(np.mean(policy_losses)),
        "value_loss": float(np.mean(value_losses)),
        "entropy": float(np.mean(entropies)),
        "clip_frac": float(np.mean(clip_fracs)),
        "approx_kl": float(np.mean(approx_kls)),
    }

6) Train PPO1 on CartPole#

We train for TOTAL_TIMESTEPS and record:

  • reward per episode

  • PPO diagnostics (losses, clip fraction, KL)

  • a final batch of ratios/advantages for plotting

episode_rewards = []
logs = []

last_ratio_snapshot = None
last_adv_snapshot = None
last_clip_active_snapshot = None

obs, _ = env.reset(seed=SEED)

for update in range(1, N_UPDATES + 1):
    rollout, ep_returns, obs = collect_rollout(env, agent, ROLLOUT_STEPS, obs)
    episode_rewards.extend(ep_returns)

    metrics = ppo_update(agent, optimizer, rollout)

    # Capture ratio/adv snapshots (after the update) for visualization
    with torch.no_grad():
        new_logp, _, _ = agent.evaluate_actions(rollout.obs, rollout.actions)
        ratio = (new_logp - rollout.log_probs).exp()
        adv = rollout.advantages
        clip_active = ((adv >= 0) & (ratio > 1.0 + CLIP_EPS)) | (
            (adv < 0) & (ratio < 1.0 - CLIP_EPS)
        )

        last_ratio_snapshot = ratio.detach().cpu().numpy()
        last_adv_snapshot = adv.detach().cpu().numpy()
        last_clip_active_snapshot = clip_active.detach().cpu().numpy().astype(bool)

    logs.append({"update": update, "timesteps": update * ROLLOUT_STEPS, **metrics, "episodes": len(episode_rewards)})

    if update % LOG_EVERY_UPDATES == 0:
        recent = episode_rewards[-10:]
        recent_mean = float(np.mean(recent)) if recent else float("nan")
        print(
            f"update {update:>3}/{N_UPDATES} | "
            f"episodes={len(episode_rewards):>4} | "
            f"recent_reward_mean(10)={recent_mean:>7.2f} | "
            f"clip_frac={metrics['clip_frac']:.3f} | "
            f"approx_kl={metrics['approx_kl']:.4f}"
        )
update   1/40 | episodes=  21 | recent_reward_mean(10)=  22.40 | clip_frac=0.000 | approx_kl=0.0000
update   2/40 | episodes=  44 | recent_reward_mean(10)=  27.20 | clip_frac=0.000 | approx_kl=0.0002
update   3/40 | episodes=  68 | recent_reward_mean(10)=  18.70 | clip_frac=0.000 | approx_kl=0.0020
update   4/40 | episodes=  85 | recent_reward_mean(10)=  33.00 | clip_frac=0.000 | approx_kl=-0.0003
update   5/40 | episodes= 105 | recent_reward_mean(10)=  31.20 | clip_frac=0.000 | approx_kl=-0.0000
update   6/40 | episodes= 123 | recent_reward_mean(10)=  31.90 | clip_frac=0.000 | approx_kl=-0.0002
update   7/40 | episodes= 142 | recent_reward_mean(10)=  27.60 | clip_frac=0.000 | approx_kl=0.0006
update   8/40 | episodes= 159 | recent_reward_mean(10)=  35.70 | clip_frac=0.000 | approx_kl=0.0001
update   9/40 | episodes= 179 | recent_reward_mean(10)=  23.50 | clip_frac=0.000 | approx_kl=-0.0000
update  10/40 | episodes= 196 | recent_reward_mean(10)=  20.40 | clip_frac=0.000 | approx_kl=0.0001
update  11/40 | episodes= 212 | recent_reward_mean(10)=  35.00 | clip_frac=0.000 | approx_kl=0.0001
update  12/40 | episodes= 229 | recent_reward_mean(10)=  31.80 | clip_frac=0.000 | approx_kl=-0.0005
update  13/40 | episodes= 242 | recent_reward_mean(10)=  39.50 | clip_frac=0.000 | approx_kl=0.0001
update  14/40 | episodes= 255 | recent_reward_mean(10)=  45.10 | clip_frac=0.000 | approx_kl=0.0008
update  15/40 | episodes= 268 | recent_reward_mean(10)=  43.00 | clip_frac=0.000 | approx_kl=0.0001
update  16/40 | episodes= 285 | recent_reward_mean(10)=  26.40 | clip_frac=0.000 | approx_kl=0.0067
update  17/40 | episodes= 300 | recent_reward_mean(10)=  37.20 | clip_frac=0.000 | approx_kl=0.0007
update  18/40 | episodes= 316 | recent_reward_mean(10)=  33.90 | clip_frac=0.002 | approx_kl=0.0029
update  19/40 | episodes= 336 | recent_reward_mean(10)=  26.90 | clip_frac=0.000 | approx_kl=-0.0001
update  20/40 | episodes= 353 | recent_reward_mean(10)=  29.70 | clip_frac=0.000 | approx_kl=0.0003
update  21/40 | episodes= 367 | recent_reward_mean(10)=  34.90 | clip_frac=0.000 | approx_kl=0.0001
update  22/40 | episodes= 387 | recent_reward_mean(10)=  27.10 | clip_frac=0.028 | approx_kl=0.0119
update  23/40 | episodes= 406 | recent_reward_mean(10)=  23.80 | clip_frac=0.089 | approx_kl=0.0049
update  24/40 | episodes= 424 | recent_reward_mean(10)=  24.30 | clip_frac=0.006 | approx_kl=0.0045
update  25/40 | episodes= 443 | recent_reward_mean(10)=  28.40 | clip_frac=0.015 | approx_kl=0.0029
update  26/40 | episodes= 459 | recent_reward_mean(10)=  30.90 | clip_frac=0.000 | approx_kl=0.0003
update  27/40 | episodes= 478 | recent_reward_mean(10)=  25.30 | clip_frac=0.021 | approx_kl=0.0051
update  28/40 | episodes= 495 | recent_reward_mean(10)=  36.30 | clip_frac=0.001 | approx_kl=0.0019
update  29/40 | episodes= 510 | recent_reward_mean(10)=  35.20 | clip_frac=0.002 | approx_kl=0.0024
update  30/40 | episodes= 521 | recent_reward_mean(10)=  43.80 | clip_frac=0.006 | approx_kl=0.0060
update  31/40 | episodes= 538 | recent_reward_mean(10)=  30.40 | clip_frac=0.049 | approx_kl=0.0050
update  32/40 | episodes= 556 | recent_reward_mean(10)=  29.40 | clip_frac=0.010 | approx_kl=0.0015
update  33/40 | episodes= 566 | recent_reward_mean(10)=  47.20 | clip_frac=0.002 | approx_kl=0.0062
update  34/40 | episodes= 586 | recent_reward_mean(10)=  23.00 | clip_frac=0.003 | approx_kl=0.0007
update  35/40 | episodes= 595 | recent_reward_mean(10)=  52.80 | clip_frac=0.001 | approx_kl=-0.0004
update  36/40 | episodes= 606 | recent_reward_mean(10)=  49.00 | clip_frac=0.002 | approx_kl=0.0068
update  37/40 | episodes= 621 | recent_reward_mean(10)=  31.80 | clip_frac=0.095 | approx_kl=0.0067
update  38/40 | episodes= 632 | recent_reward_mean(10)=  44.30 | clip_frac=0.000 | approx_kl=0.0029
update  39/40 | episodes= 646 | recent_reward_mean(10)=  32.90 | clip_frac=0.000 | approx_kl=0.0005
update  40/40 | episodes= 657 | recent_reward_mean(10)=  46.00 | clip_frac=0.001 | approx_kl=0.0036
df_logs = pd.DataFrame(logs)
df_logs.head()
update timesteps policy_loss value_loss entropy clip_frac approx_kl episodes
0 1 512 -0.003087 89.048697 0.692916 0.0 0.000009 21
1 2 1024 -0.001868 88.419497 0.691513 0.0 0.000157 44
2 3 1536 -0.004179 78.124797 0.688415 0.0 0.002013 68
3 4 2048 -0.001415 110.220530 0.686069 0.0 -0.000294 85
4 5 2560 -0.001063 96.217341 0.683129 0.0 -0.000024 105

7) Plotly: reward per episode (learning curve)#

This is the most direct signal for whether the policy is improving.

df_ep = pd.DataFrame({"episode": np.arange(len(episode_rewards)), "reward": episode_rewards})

window = 20
if len(df_ep) >= window:
    df_ep["reward_ma"] = df_ep["reward"].rolling(window).mean()

fig = px.line(df_ep, x="episode", y="reward", title="CartPole reward per episode")
if "reward_ma" in df_ep.columns:
    fig.add_trace(
        go.Scatter(x=df_ep["episode"], y=df_ep["reward_ma"], name=f"MA({window})")
    )
fig.update_layout(xaxis_title="Episode", yaxis_title="Total reward")
fig.show()

8) Plotly: PPO diagnostics over updates#

We visualize clipping behavior and losses over training updates.

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["clip_frac"], name="clip_frac"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["approx_kl"], name="approx_kl"))
fig.update_layout(title="PPO diagnostics", xaxis_title="Update", yaxis_title="Value")
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["policy_loss"], name="policy_loss"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["value_loss"], name="value_loss"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["entropy"], name="entropy"))
fig.update_layout(title="Losses over updates", xaxis_title="Update", yaxis_title="Loss / entropy")
fig.show()

9) Plotly: policy ratios (r_t) and clipping#

Below we plot the distribution of (r_t) and highlight where clipping is active.

  • The histogram should concentrate near 1.0.

  • As training progresses, some mass moves outside ([1-\epsilon, 1+\epsilon]), but PPO discourages large deviations.

ratios = last_ratio_snapshot

fig = go.Figure()
fig.add_trace(go.Histogram(x=ratios, nbinsx=60, name="r_t"))
fig.add_vline(x=1.0 - CLIP_EPS, line_dash="dash", line_color="orange")
fig.add_vline(x=1.0, line_dash="dash", line_color="gray")
fig.add_vline(x=1.0 + CLIP_EPS, line_dash="dash", line_color="orange")
fig.update_layout(
    title="Policy ratio distribution (last rollout)",
    xaxis_title="r_t = pi_new(a|s) / pi_old(a|s)",
    yaxis_title="Count",
)
fig.show()
df_ratio = pd.DataFrame(
    {
        "ratio": last_ratio_snapshot,
        "advantage": last_adv_snapshot,
        "clip_active": last_clip_active_snapshot,
    }
)

fig = px.scatter(
    df_ratio,
    x="ratio",
    y="advantage",
    color="clip_active",
    title="Where clipping is active (last rollout)",
    labels={"ratio": "r_t", "advantage": "A_t"},
)
fig.add_vline(x=1.0 - CLIP_EPS, line_dash="dash", line_color="orange")
fig.add_vline(x=1.0, line_dash="dash", line_color="gray")
fig.add_vline(x=1.0 + CLIP_EPS, line_dash="dash", line_color="orange")
fig.show()

10) Stable-Baselines PPO1 (web research)#

A Stable-Baselines implementation of PPO1 exists (legacy TensorFlow 1.x codebase):

  • Repo: https://github.com/hill-a/stable-baselines

  • PPO1 package: https://github.com/hill-a/stable-baselines/tree/master/stable_baselines/ppo1

  • Main file: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo1/pposgd_simple.py

    • Exposes class PPO1(...) (imported by stable_baselines/ppo1/__init__.py)

The original OpenAI Baselines PPO implementation is also available:

  • https://github.com/openai/baselines/tree/master/baselines/ppo1

Example usage (not run here):

from stable_baselines import PPO1
import gym

env = gym.make("CartPole-v1")
model = PPO1("MlpPolicy", env, clip_param=0.2, timesteps_per_actorbatch=2048)
model.learn(total_timesteps=1_000_000)

Note: Stable-Baselines is archived/legacy and uses TF1/MPI; Stable-Baselines3 is PyTorch and offers PPO (conceptually closer to PPO2-style implementations).

11) Stable-Baselines PPO1 hyperparameters (explained)#

Stable-Baselines PPO1 (legacy TensorFlow/MPI) exposes the following constructor signature (from stable_baselines/ppo1/pposgd_simple.py):

PPO1(
    policy,
    env,
    gamma=0.99,
    timesteps_per_actorbatch=256,
    clip_param=0.2,
    entcoeff=0.01,
    optim_epochs=4,
    optim_stepsize=1e-3,
    optim_batchsize=64,
    lam=0.95,
    adam_epsilon=1e-5,
    schedule='linear',
    verbose=0,
    tensorboard_log=None,
    _init_setup_model=True,
    policy_kwargs=None,
    full_tensorboard_log=False,
    seed=None,
    n_cpu_tf_sess=1,
)

What each hyperparameter does#

  • policy: policy class (or registered string) like MlpPolicy, CnnPolicy, etc.

  • env: Gym env instance or an env id string (e.g. 'CartPole-v1').

  • gamma: discount factor \(\gamma\).

  • timesteps_per_actorbatch: number of environment steps collected per update per actor (batch size).

  • clip_param: PPO clip parameter \(\epsilon\).

  • entcoeff: entropy coefficient (larger → more exploration pressure).

  • optim_epochs: number of epochs over the on-policy batch per update.

  • optim_stepsize: optimizer step size (learning rate), optionally controlled by schedule.

  • optim_batchsize: minibatch size.

  • lam: GAE(\(\lambda\)) parameter.

  • adam_epsilon: Adam epsilon for numerical stability.

  • schedule: learning-rate schedule type (e.g. 'linear', 'constant', …).

Mapping to this notebook#

  • SB timesteps_per_actorbatch → this notebook’s ROLLOUT_STEPS

  • SB clip_paramCLIP_EPS

  • SB entcoeffENT_COEF

  • SB optim_epochsUPDATE_EPOCHS

  • SB optim_stepsizeLEARNING_RATE

  • SB optim_batchsizeMINIBATCH_SIZE

  • SB gammaGAMMA

  • SB lamGAE_LAMBDA

  • SB adam_epsilonADAM_EPS

  • SB schedule → not implemented here (easy extension: linearly decay LEARNING_RATE over updates)

Practical tuning hints#

  • If reward collapses: reduce LEARNING_RATE, reduce UPDATE_EPOCHS, or reduce CLIP_EPS.

  • If learning is slow: increase ROLLOUT_STEPS, increase UPDATE_EPOCHS, or slightly increase LEARNING_RATE.

  • Watch approx_kl and clip_frac: sustained high values mean policy updates are too aggressive.

Pitfalls + exercises#

  • If training is unstable: lower LEARNING_RATE, check advantage normalization, and verify the done/bootstrapping logic.

  • If clip_frac is near 0.0: updates may be too small (try higher LR or more epochs).

  • If clip_frac is very high: updates are too aggressive (try smaller LR or smaller CLIP_EPS).


Exercises#

  1. Add an entropy bonus (ENT_COEF > 0) and compare learning curves.

  2. Implement value function clipping (as in some PPO variants) and compare critic stability.

  3. Switch to a continuous-action env (e.g., Pendulum) using a Gaussian policy.


References#

  • Schulman et al., Proximal Policy Optimization Algorithms (2017): https://arxiv.org/abs/1707.06347

  • Stable-Baselines PPO1 source (TF1): https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo1/pposgd_simple.py